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Abstract 


v  Fat-trees  are  a  class  of  routing  networks  for  hardware-efficient  parallel  computa¬ 
tion.  This  paper  presents  a  randomized  algorithm  for  routing  messages  on  a  fat-tree. 
The  quality  of  the  algorithm  is  measured  in  terms  of  the  load  factor  of  a  set  of  mes¬ 
sages  to  be  routed,  which  is  a  lower  bound  on  the  time  required  to  deliver  the 
messages.  We  Show  that  if  a  set  of  messages  has  load  factor  A  on  a  fat-tree  with 
n  processors,  the  number  of  delivery  cycles  (routing  attempts)  that  the  algorithm 
requires  is  0(A  +  lgnlgIgn)  with  probability  1  -  C?(l/n).  The  best  previous  bound 
was  0(A  lg  n)  for  the  off-line  problem  where  switch  settings  can  be  determined  in  ad¬ 
vance.  In  a  VLSI-like  model  where  hardware  coat  is  equated  with  physical  volume, 
the  routing  algorithm  demonstrates  that  fat-trees  are  universal  routing  networks 
in  the  sense  that  any  routing  network  can  be  efficiently  simulated  by  a  fat-tree  of 
comparable  hardware  cost. 


1  Introduction 

Fat-trees  constitute  a  class  of  routing  networks  for  general-purpose  parallel  computation. 
This  paper  presents  a  randomized  algorithm  for  routing  a  set  of  messages  on  a  fat- 
tree.  The  routing  algorithm  and  its  analysis  generalize  an  earlier  universality  result  by 
showing,  in  a  three-dimensional  VLSI  model,  that  for  a  given  volume  of  hardware,  a 
fat-tree  in  nearly  the  best  routing  network  that  can  be  built.  This  universality  result  had 
been  pro/ed  only  for  off-line  simulations  [10],  where  switch  settings  can  be  determined 
in  advance;  this  paper  extends  it  to  the  more  interesting  on-line  case,  where  messages 
are  spontaneously  generated  by  processors. 

As  is  illustrated  in  Figure  1,  a  fat-tree  is  a  routing  network  based  on  Leighton’s  tree- 
of-meshe?  graph  [8].  A  set  of  n  processors  are  located  at  the  leaves  of  a  complete  binary 
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Figure  1:  The  organisation  of  a  fat-tree.  Processors  are  located  at  the  leaves,  and  the  internal  nodes 
contain  concentrator  switches.  The  capacities  of  channels  increase  as  we  go  np  the  tree. 

tree.  Each  edge  of  the  underlying  tree  corresponds  to  two  channels  of  the  fat-tree:  one 
from  parent  to  child,  the  other  from  child  to  parent.  Unlike  a  normal  tr^  interconnection 
which  is  “skinny  all  over,”  each  channel  of  a  fat-tree  consists  of  a  bundle  of  wires.  The 
number  of  wires  in  a  channel  c  is  called  its  capacity,  denoted  by  cap(c).  Each  internal 
node  of  the  fat-tree  contains  circuitry  that  switches  messages  from  incoming  to  outgoing 
channels.  The  capacities  of  the  channels  in  a  fat-tree  determine  how  much  hardware  is 
required  to  build  it,  where  we  measure  hardware  in  terms  of  three-dimensional  volume. 
The  greater  the  capacities  of  the  channels,  the  greater  the  communication  potential, 
and  also,  the  greater  the  physical  volume  of  an  implementation  of  the  network.  The 
randomized  routing  algorithm  that  will  be  presented  in  this  paper  can  be  used  to  show 
that  a  fat-tree  with  properly  chosen  channel  capacities  is  a  universal  network  for  a  given 
volume  of  circuitry. 

The  problem  that  a  routing  algorithm  for  a  volume-universal  network  must  face  is 
“pin-boundedness” — the  bandwidth  limitation  imposed  by  surfaces  of  three-dimensional 
regions — a  constraint  that  makes  some  communication  patterns  among  a  set  of  processors 
harder  than  others.  To  illustrate  this  point,  consider  a  three-dimensional  region  of  volume 
v  containing  an  n-processor  routing  network,  and  consider  a  plane  that  cuts  through  the 
region  perpendicular  to  the  longest  dimension  and  which  divides  the  set  of  processors 
in  half.  Suppose  each  processor  sends  a  message  to  a  processor  on  the  other  side  of 
the  cut.  Since  the  cross-sectional  area  of  the  cut  has  area  0(v3/,s),  the  time  required 
by  the  network  to  deliver  all  the  messages  is  fl(n/v3/3).  If  the  processors  fill  the  region 
with  substantial  density  (i.e.,  v  =  0(ns/a“*)),  where  e  >  0),  the  time  required  to  deliver 
the  messages  is  polynomial  in  n.  In  contrast,  the  communication  pattern  in  which  each 
processor  communicates  with  its  nearest  neighbor  in  the  region,  as  in  a  three-dimensional 
systolic  array,  can  be  accomplished  in  constant  time. 

A  volume-universal  network  should  be  able  to  simulate  the  communication  of  any 
(bounded-degree)  network  of  a  given  volume  with  at  most  polylogarithmic  degradation 
in  time,  much  as  traditional  universal  networks  can  simulate  the  communication  of  any 
network  of  a  given  number  of  processors  with  at  most  polylogarithmic  degradation.  The 


routing  algorithm  that  we  will  give  to  show  that  fat-trees  are  volume-universal  networks 
cannot  use  traditional  randomization  techniques  such  as  the  one  proposed  by  Valiant  in 
his  seminal  paper  [15],  however.  Using  Valiant’s  technique  for  routing  permutations  on  a 
hypercube,  each  message  is  sent  to  a  random  intermediate  destination,  and  from  there,  to 
its  true  destination.  Using  the  analysis  above,  the  expected  time  for  every  permutation, 
including  simple  nearest-neighber  communication,  is  Cl(n/v J/3)  because  we  expect  nj 2 
messages  to  cross  the  cut  when  they  are  sent  to  their  intermediate  destination?.  Thu3,  if 
Valiant’s  technique  is  used,  the  simulation  of  routing  networks  whose  processors  densely 
fill  a  give  n  volume  causes  polynomial  degradation  in  time. 

This  oaper  presents  and  analyzes  a  randomized  algorithm  for  routing  on  fat-trees 
which  shows  that  fat-trees  can  efficiently  simulate  any  routing  network  of  a  comparable 
volume.  vVe  present  a  measure  of  congestion  for  a  set  of  messages,  called  the  load  factor, 
which  is  a  lower  bound  on  the  time  to  route  the  messages  on  a  fat-tree.  We  show 
that  if  a  set  of  messages  has  load  factor  A,  our  routing  algorithm  can  route  them  in 
0(A  +  lgnlglgn)  delivery  cycles  (routing  ittempts)  with  high  probability.  The  best 
previous  bound  for  a  problem  of  this  nature  was  an  O(Algn)  bound  for  the  off-line 
situation  where  the  set  of  messages  is  known  in  advance  [10]. 

The  i  .nalysis  in  terms  of  load  factor  is  not  restricted  to  permutation  routing  or  situ¬ 
ations  where  each  processor  can  only  send  or  receive  a  constant  number  of  messages,  as 
is  common  in  the  literature.  We  consider  the  general  situation  where  each  processor  can 
send  anc  receive  polynomially  many  messages.  Furthermore,  we  make  no  assumptions 
about  this  statistical  distribution  of  mess  ges,  except  insofar  as  they  affect  the  load  factor. 
Our  rout  ing  algorithm  also  differs  from  others  in  the  literature  in  the  way  randomization 
is  used.  Unlike  the  algorithms  of  Valiant  [15],  Valiant  and  Brebner  [16],  A’eliunas  [2], 
Upfai  [U.  ]  and  Pippenger  [12],  for  example,  it  does  not  randomize  with  respect  to  paths 
taken  by  messages.  Instead,  for  each  delivery  cycle,  each  undelivered  message  randomly 
choose:,  vhether  to  be  sent. 

The  emainder  of  this  paper  is  organized  as  follows.  Section  2  further  describes  the 
fat-tree  network,  defines  the  load  factor,  and  discusses  universality.  Section  3  presents 
the  randomized  algorithm  for  efficiently  ro  ting  messages  on  the  fat-tree  network,  and 
Section,  contains  the  full  analysis  ot  cbe  algorithm.  Section  5  gives  an  existential  lower 
bound  ft  r  the  naive  greedy  approach  t,o  routing  messages  which  shows  that  the  greedy 
s'nb'gy  ir,  inferior  to  the  randomized  algorithm  for  worst  case  inputs.  Section  6  contains 
a  v-  riotA  of  results  that  follow  from  -he  randomized  routing  algorithm.  It  shows  how 
the  universality  result  of  [10]  can  extended  to  on-line  simulations,  and  it  includes  a 
modification  of  the  routing  algorithm  which  achieves  better  bounds  when  each  channel 
ha-s  capacity  fl(lgn).  Finally,  Section  ~  contains  some  concluding  remarks. 

2  Fa , -trees 

This  see  ion  describes  an  implementation  of  a  fat-tree  routing  network,  and  it  shows  how 
to  choos  e  the  channel  capacities  for  volume-universal  and  area-universal  fat-trees.  We 
precisely  define  the  load  factor  of  a  set  of  messages  on  a  general  network,  which  is  a  lower 


bound  on  the  time  required  to  deliver  the  messages.  We  give  a  proof  that  fat-trees  satisfy 
a  simple  universality  property,  and  we  indicate  why  a  good  message-routing  algorithm 
based  on  load  factor  can  strengthen  this  result. 

The  implementation  of  fat-trees  described  here  follows  that  of  [10].  We  consider  com¬ 
munication  through  the  fat-tree  network  to  be  synchronous,  bit  serial,  and  batched.  By 
synchronous,  we  mean  that  the  system  is  globally  clocked.  By  bit  serial,  we  mean  that 
the  messages  can  be  thought  of  as  bit  streams.  Each  message  snakes  its  way  through 
the  wires  and  switches  of  the  fat-tree,  with  leading  bits  of  the  message  setting  switches 
and  establishing  a  path  for  the  remainder  to  follow.  By  batched,  we  mean  the  messages 
are  grouped  into  delivery  eyelet.  During  a  delivery  cycle,  the  processors  send  messages 
through  the  network.  Each  message  attempts  to  establish  a  path  from  its  source  to 
its  destination.  Since  some  messages  may  be  unable  to  establish  connections  during  a 
delivery  cycle,  each  successfully  delivered  message  is  acknowledged  through  its  commu¬ 
nication  path  at  the  end  of  the  cycle.  Rather  than  buffering  undelivered  messages,  we 
simply  allow  them  to  try  again  in  a  subsequent  delivery  cycle.  The  routing  algorithm  is 
responsible  for  grouping  the  messages  into  delivery  cycles  so  that  all  the  messages  are 
delivered  in  as  few  cycles  as  possible. 

The  mechanics  of  routing  messages  in  a  fat-tree  are  similar  to  routing  in  an  ordinary 
tree.  For  each  message,  there  is  a  unique  path  from  its  source  processor  to  its  destination 
processor  in  the  underlying  complete  binary  tree,  which  can  be  specified  by  a  relative 
address  consisting  of  at  most  2  Ig  n  bits  telling  whether  the  message  turnB  left  or  right 
at  each  internal  node. 

Within  each  node  of  the  fat-tree,  the  messages  destined  for  a  given  output  channel 
are  concentrated  onto  the  available  wires  of  that  channel.  This  concentration  may  result 
in  “lost”  messages  if  the  number  of  messages  destined  for  the  output  channel  exceeds  the 
capacity  of  the  channel.  We  assume,  however,  that  the  concentrators  within  the  node 
are  ideal  in  the  sense  that  no  messages  are  lost  if  the  number  of  messages  destined  for 
a  channel  is  less  than  or  equal  to  the  capacity  of  the  channel.  Such  a  concentrator  can 
be  built,  for  example,  with  a  logarithmic-depth  sorting  network  [1].  A  somewhat  more 
practical  logarithmic-depth  circuit  can  be  built  by  combining  a  parallel  prefix  circuit  [7] 
with  a  butterfly  (i.e.,  FFT,  Omega)  network.  With  switches  of  logarithmic  depth,  the 
time  to  run  each  delivery  cycle  is  0(lg*  n)  bit  times.  (Section  6  contains  another  fat-tree 
design  where  the  time  to  run  a  delivery  cycle  is  O(lgn)  bit  times.) 

The  performance  of  any  routing  algorithm  for  a  fat-tree  depends  on  the  locality 
of  communication  inherent  in  a  set  of  messages.  The  locality  of  communication  for  a 
message  set  M  can  be  summarized  by  a  measure  A(Af)  called  the  load  factor,  which  we 
define  in  a  more  general  network  setting. 

Definition:  Let  R  be  a  routing  network.  A  set  5  of  wires  in  R  is  a  (dr  acted) 
cut  if  it  partitions  the  network  into  two  sets  of  processors  A  and  B  such  that 
every  path  from  a  processor  in  A  to  a  processor  in  B  contains  a  wire  in  S. 

The  capacity  cap (5)  is  the  number  of  wires  in  the  cut.  For  a  set  of  messages 
Af ,  define  the  load  load(M,  5)  of  M  on  a  cut  S  to  be  the  number  of  messages 
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in  M  that  must  cross  S.  The  load  factor  of  M  on  S  is 

ww  ox  _  load(M,5) 
cap(5)  , 

and  the  load  factor  of  M  on  the  entire  network  R  is 

A(M)  =  max  A  (M,S)  . 
s 

The  load  factor  of  a  set  of  messages  on  a  given  network  provides  a  simple  lower  bound 
on  the  time  required  to  deliver  all  messages  in  the  set.  For  fat-trees,  the  load  factor  of 
a  set  of  messages  is  determined  by  the  cuts  on  the  channels  alone. 

Lemma  1  The  load  factor  of  a  set  M  of  messages  on  a  fat-tree  is 

A  (A/)  =  max  A  (A/,  c)  , 

where  c  ranges  over  all  channels  of  the  fat-tree.  | 

The  randomized  routing  algorithm  for  fat-trees  presented  in  Section  3  can  deliver  a  set 
M  of  messages  in  0(A(M)  +  lgnlglgn)  delivery  cycles  with  high  probability.  In  fact, 
the  running  time  is  asymptotically  even  less  for  message  sets  with  small  load  factors. 

We  are  particularly  interested  in  the  application  of  the  routing  results  to  universal 
fat-trees.  In  order  for  a  fat- tree  to  be  universal  for  area,  the  channel  capacities  must 
be  picked  properly.  One  way  is  to  give  each  leaf  channel  a  constant  capacity,  and  then 
grow  the  channel  capacities  by  \/2  at  each  level  as  we  go  up  the  tree,  rounding  off  to 
integer  capacities  where  needed.  Another  scheme  that  avoids  rounding  is  to  double  the 
channel  capacities  every  two  levels,  as  is  shown  in  Figure  1.  Either  of  these  methods 
yields  a  0(ra  lg2  n)-area  layout  for  n  processors,  and  a  root  capacity  of  0(^/n).  Volume- 
universal  fat-trees  can  be  constructed  in  a  similar  fashion  by  picking  a  growth  rate  of 
v/ 4 ,  or  equivalently,  by  quadrupling  the  -apacity  every  three  levels.  The  volume  of  an 
n-processor  fat-tree  constructed  by  these  methods  is  0(nlg8^a  n),  and  the  root  capacity 
is  0(n2^). 

Even  without  a  good  routing  algorithm  for  fat-trees,  it  is  possible  to  prove  a  simple 
universality  property. 

Lemma  2  Let  R  be  an  interconnection  network  of  area  n  such  that  all  con¬ 
nections  are  point-to-point  between  processors  with  no  intervening  switches. 

Then  an  area-universal  fat-tree  of  area  0(nlg3n)  can  simulate  every  step  of 
network  R  with  at  most  0(lgJ  n)  switching  delay. 

Proof.  We  assume  without  loss  of  generality,  that  network  R  lies  in  a  square  with 
side  length  y/n.  The  layout  of  the  fat-tree  has  y/n  processors  on  a  side,  and  thus  each 
processor  of  R  can  be  mapped  to  the  corresponding  processor  of  the  fat-tree  in  the 
natural  geometric  fashion.  This  mapping  satisfies  the  property  that  the  capacity  of  any 
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channel  of  the  fat-tree  is  proportional  to  the  perimeter  of  the  corresponding  region  of 
the  layout  of  network  R.  Therefore,  any  communication  step  performed  by  R  induces 
at  most  a  load  factor  of  1  on  the  fat-tree  and  thus  can  be  routed  in  one  delivery  cycle, 
which  takes  0(lg*  n)  time.  | 

This  universality  result  can  be  strengthened  if  a  good  routing  algorithm  for  fat-trees 
is  known.  For  example,  it  seems  natural  to  consider  networks  with  intermediate  switches 
that  might  buffer  messages  for  several  time  steps.  Given  a  set  of  messages  that  are 
delivered  over  time  bn  such  a  network,  the  load  factor  induced  on  channels  of  the  fat-tree 
is  typically  greater  than  1.  We  could  model  switches  as  processors,  but  we  would  like  to 
prove  a  stronger  universality  theorem  without  disrupting  the  abstraction  of  processors 
connected  to  a  routing  network.  Thus,  we  must  have  a  routing  algorithm  that  can 
directly  route  messages  sets  with  large  load  factors.  Section  3  presents  a  general  routing 
algorithm  for  fat-trees  that  routes  messages  quickly  with  high  probability.  Section  6  uses 
this  routing  algorithm  to  show  that  a  fat-tree  of  a  given  volume  with  n  processors  can 
simulate  any  other  n-processor  network  of  the  same  volume  with  only  polylogarithmic 
degradation  in  time. 

3  The  routing  algorithm 

This  section  gives  a  randomized  algorithm  for  routing  a  set  M  of  messages.  The  algorithm 
RANDOM ,  which  is  based  on  routing  random  subsets  of  the  messages  in  Af,  is  shown 
in  Figure  2.  It  uses  the  subroutine  TRY-GUESS  shown  in  Figure  3.  Section  4  provides 
a  proof  that  on  an  n-processor  fat-tree,  the  probability  is  at  least  1  -  0(l/n)  that 
RANDOM  delivers  all  messages  in  M  within  0(A(Af)  +  lgnlglgn)  delivery  cylces,  if  the 
two  constants  kt  and  k7  appearing  in  the  algorithm  are  properly  chosen. 

The  basic  idea  of  RANDOM  is  to  pick  a  random  subset  of  messages  to  send  in  each 
delivery  cycle  by  independently  choosing  each  message  with  some  probability  p.  This 
idea  is  sufficiently  important  to  merit  a  formal  definition. 

Definition:  A  p-subset  of  M  is  a  subset  of  M  formed  by  independently 

choosing  each  message  of  M  with  probability  p. 

We  will  show  in  Section  4  that  if  p  is  sufficiently  small,  a  substantial  portion  of  the 
messages  in  a  p-subset  are  delivered  because  they  encounter  no  congestion  during  routing. 
On  the  other  hand,  if  p  is  too  small,  few  messages  are  sent.  RANDOM  varies  the 
probability  p  from  cycle  to  cycle,  seeking  random  subsets  of  M  that  contain  a  substantial 
portion  of  the  messages  in  M ,  but  that  do  not  cause  congestion. 

The  algorithm  RANDOM  varies  the  probability  p  because  the  load  factor  A (M)  is 
not  known.  The  overall  structure  of  RANDOM  is  to  guess  the  load  factor  and  call  the 
subroutine  TRY-GUESS  for  each  one.  The  subroutine  TRY-GUESS  determines  the 
probability  p  based  on  RANDOM a  guess  Anf„  and  a  parameter  r,  called  the  congestion 
parameter  of  the  fat-tree,  which  is  independent  of  the  message  set  and  which  will  be 
defined  in  Section  4.  If  A^,,  is  an  upper  bound  on  the  true  load  factor  A (M),  each 
iteration  of  the  while  loop  in  TRY-GUESS  halves  the  load  factor  A (£/)  of  the  set  U  of 


send  M 

U  *—  M  —  {messages  delivered} 

2 

while  ki\t WM  <  k2  lg  n  and  U  ^  0  do 
TRY-GUESSiXgm) 


7  endwhile 

8  A^„  4-  [ki/ki)  lg  n  lg  lg  n 

9  while  17  #  0  do 

10  TRY-GUESS{X 

11  *  2A^||| 

12  endwhile 

Figure  U:  The  randomised  algorithm  RANDOM  for  delivering  a  message  set  M  on  a  fat-tree  with  n 
processors.  This  algorithm  achieves  the  running  times  in  Figure  4  with  high  probability  if  the  constants 
kx  and  k2  are  appropriately  chosen.  Since  the  load  factor  A(JWf)  is  not  known  in  advance,  RANDOM 
makes  guesses,  each  one  being  tried  out  by  the  subroutine  TRY-GUESS. 


procedure  TR Y- G  UESS(Xirit,t ) 

1  X  4  Xgmil 

2  while  A  >  1  do 

3  for  t  ♦-  l  to  max^A.fcjIgn)  do 

4  independently  send  each  message  of  U  with  probability  1  /r A 

5  17  4—  17  —  {messages  delivered) 

6  endfor  L 

7  A  4—  A/2  'r 

8  endwhile  a 

9  send  U  j 

10  J7  -—  1/  —  {messages  delivered) 

Figure  3:  The  subroutine  TRY-GUESS  used  by  the  algorithm  RANDOM  which  tries  to  deliver  the  set 
U  of  currently  undelivered  messages.  When  A,»t„  >  A(C7),  this  attempt  will  be  successful  with  high 
probability,  if  the  constants  k j  and  k2  ait  appropriately  chosen.  (The  value  r  is  the  congestion  parameter 
of  the  fat-  ree  defined  in  Section  4,  which  is  typically  a  small  constant.)  In  that  case,  A  is  always  an  upper 
bound  on  A(f/),  which  is  at  least  halved  in  each  iteration  of  the  while  loop.  When  the  loop  is  Cnished, 
A ([/)  <  1,  so  all  the  remaining  messages  can  be  sent. 


load  factor  delivery  cycles 

0  <  A(Af)  <  1  1 

1<\{M)<2  O(lgn) 

2  <  A(M)  <  lg  n  lg  lg  n  0(lg  n  lg(A(M))) 
lgnlglgn  <  A (M)  <  n°W  0(X(M)) 

Figure  4:  The  number  of  delivery  cycles  required  to  deliver  a  message  set  M  on  a  fat-tree  with  n 
processors.  All  bounds  are  achieved  with  probability  1  —  0(l/n). 

undelivered  messages  with  high  probability,  as  will  be  shown  in  Section  4.  When  the 
loop  is  finished,  we  have  A (U)  <  1,  and  all  the  remaining  messages  can  be  delivered  in 
one  cycle.  The  number  of  delivery  cycles  performed  by  TRY-GUESS  is  0(lg  A^„  Ign)  if 
2  <  A,*,*,  <  ©(lgn),  and  the  number  of  cycles  is  OfA,^  +  lgnlglgn)  if  A^,  =  ft  (Ign). 

RANDOM  must  make  judicious  guesses  for  the  load  factor  because  TRY-GUESS 
may  not  be  effective  if  the  guess  is  smaller  than  the  true  load  factor.  Conversely,  if  the 
guess  is  too  large,  too  many  delivery  cycles  will  be  performed.  Since  the  amount  of  work 
done  by  TRY-GUESS  grows  as  lg X^..,  for  X^,,,  small,  and  as  A^M  for  A^,,  large,  there 
are  two  main  phases  to  RANDOM’S  guessing.  (These  phases  follow  the  handling  of  very 
small  load  factors,  i.e.,  X(M)  <  2.) 

In  the  first  phase,  the  guesses  are  squared  from  one  trial  to  the  next.  Once  A^,, 
is  sufficiently  large,  we  move  into  the  second  phase,  and  the  guesses  are  doubled  from 
one  trial  to  the  next.  In  each  phase,  the  number  of  delivery  cycles  run  by  TRY-GUESS 
from  one  call  to  the  next  forms  a  geometric  series.  Thus,  the  work  done  in  any  call  to 
TRY-GUESS  is  only  a  constant  factor  times  all  the  work  done  prior  to  the  call.  With 
this  guessing  strategy,  we  can  deliver  a  message  set  using  only  a  constant  factor  more 
delivery  cycles  than  would  be  required  if  we  knew  the  load  factor  in  advance. 

4  Analysis  of  the  routing  algorithm  RANDOM 

This  section  contains  the  analysis  of  RANDOM ,  the  routing  algorithm  for  fat-trees  pre¬ 
sented  in  Section  3.  We  shall  show  that  the  probability  is  1  —  0(l/n)  that  RANDOM 
delivers  a  set  M  of  messages  on  a  universal  fat-tree  with  n  processors  in  the  number  of 
delivery  cycles  given  by  Figure  4.  This  may  be  summarized  as  0(X(M)  +  lgnlglgn) 
delivery  cycles  for  all  message  sets. 

We  begin  by  stating  two  technical  lemmas  concerning  basic  probability.  One  is  a 
combinatorial  bound  on  the  tail  of  the  binomial  distribution  of  the  kind  attributed  to 
Chernoff  [4],  and  the  other  is  a  more  general,  but  weaker,  bound  on  the  probability  that 
a  random  variable  takes  on  values  smaller  than  the  expectation. 

The  first  lemma  is  the  Chernoff  bound.  Consider  t  independent  Bernoulli  trials,  each 
with  probability  p  of  success.  It  is  well  known  [5]  that  the  probability  that  there  are  at 
least  a  successes  out  of  the  t  trials  is 


The  lemma  bounds  the  probability  that  the  number  of  successes  is  larger  than  the  ex¬ 
pectation  pt. 

Lemma  3  # 

(jf)  I 

The  second  technical  lemma  bounds  the  probability  that  a  bounded  random  variable 
takes  on  values  smaller  than  the  expectation. 

Lemma  4  Let  X  <  b  be  a  random  variable  with  expectation  p.  Then  for  any 
w  <  p,  we  have 

Pt{X<w}  <  1-  — ^ -  .  | 
b  —  w 

We  now  analyze  the  routing  of  a  p-subset  M'  of  a  set  M  of  messages.  If  the  number 
load(M',  c )  of  messages  in  M1  that  must  pass  through  c  is  no  more  than  the  capacity 
cap(c),  then  no  messages  are  lost  by  concentrating  the  messages  into  c.  We  shall  say  that 
c  is  congested  by  M'  if  load(M',c)  >  cap(c).  The  next  lemma  shows  that  the  likelihood 
of  channel  congestion  decreases  exponentially  with  channel  capacity  if  the  probability  of 
choosing  a  given  message  in  M  is  sufficiently  small. 

Lemma  5  Let  M  be  a  set  of  messages  on  a  fat-tree,  let  X(M)  be  the  load 
factor  on  the  fat-tree  due  to  M,  let  M'  be  a  p-subset  of  messages  from  M, 
and  let  c  be  a  channel  through  which  a  given  message  m  6  M'  must  pass. 

Then  the  probability  is  at  most  (epA(M))c*p^e^  that  channel  c  is  congested  by 

M!. 

Proof.  Channel  c  is  congested  by  M'  if  load(M',c)  >  cap(c).  There  is  already  one 
message  from  the  set  M'  going  through  channel  c,  so  we  must  determine  a  bound  on 
the  probability  that  at  least  cap(c)  other  messages  go  through  c.  Using  Lemma  3  with 
s  =  cap(c)  and  t  =  load(M,  c),  the  probability  that  the  number  of  messages  sent  through 
channel  c  is  greater  than  the  capacity  cap(c)  is  less  than 

S(cap(c),1oad(M,c,.?,  <  <))"*' 

<  (epA(M))“pW  .  | 

The  next  lemma  analyzes  the  probability  that  a  given  message  of  a  p-subset  of  Af  gets 
delivered.  In  order  to  do  the  analysis,  however,  we  must  select  p  small  enough  so  that  it 
is  likely  that  the  message  passes  exclusively  through  uncongested  channels.  The  choice 
of  p  depends  on  the  capacities  of  channels  in  the  fat-tree.  For  convenience,  we  define  one 
parameter  of  the  capacities  which  will  enable  us  choose  a  suitable  upper  bound  for  p. 
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Definition:  The  congestion  parameter  r  of  fat-tree  is  the  smallest  poeltive 
value  such  that  for  each  simple  path  ei,  cj, . . . ,  cj  of  channels  in  the  fat-tree, 
we  have 


£(;) 


«•!>(«*)  1 
<  - 
~  2 


For  a  fat-tree  based  on  a  complete  binary  tree,  the  longest  simple  path  is  at  most 
21gn,  where  n  is  the  number  of  processors,  and  thus  r  <  4elgn.  For  universal  fat- 
trees,  the  congestion  parameter  is  a  constant  because  the  capacities  of  channels  grow 
exponentially  as  we  go  up  the  tree.  (All  we  really  need  is  arithmetic  growth  in  the 
channel  capacities.)  The  congestion  parameter  is  also  constant  for  any  fat-tree  based 
on  a  complete  binary  tree  if  all  the  channels  have  capacity  fl(lglgn).  The  remaining 
analysis  treats  the  congestion  parameter  r  as  a  constant,  but  the  analysis  does  not 
change  substantially  for  other  cases. 

We  now  present  the  lemma  that  analyzes  the  probability  that  a  given  message  gets 
delivered. 


Lemma  0  Let  M  be  a  set  of  messages  on  a  fat-tree  which  has  congestion 
parameter  r,  let  A (M)  be  the  load  factor  on  the  fat-tree  due  to  M,  and  let 
m  be  an  arbitrary  message  in  M.  Suppose  M’  is  a  p-subset  of  M,  where 
p  <  l/rX(M).  Then  if  AT  is  sent,  the  probability  that  m  gets  delivered  is  at 
least  \p. 


Proof.  The  probability  that  m  €  M  is  delivered  is  at  least  the  probability  that  m  G  M' 
times  the  probability  that  m  passes  exclusively  through  uncongested  channels.  The  prob¬ 
ability  that  m  €  M*  is  p,  and  thus  we  need  only  show  that,  given  m  E  M\  the  probability 
is  at  least  |  that  every  channel  through  which  m  must  pass  is  uncongested.  Let  C|,  cj, 
. . . ,  cj  be  the  channels  in  the  fat-tree  through  which  m  must  pass.  The  probability  that 
channel  ck  is  congested  is  less  than  (e/r)*^**)  by  Lemma  5.  The  probability  that  at  least 
one  of  the  channels  is  congested  is,  therefore,  much  less  than 


e»p(«k) 


by  definition  of  the  congestion  parameter.  Thus,  the  probability  that  none  of  the  channels 
are  congested  is  at  least  £.  | 

We  now  focus  our  attention  on  RANDOM  itself.  The  next  lemma  analyzes  the 
innermost  loop  (lines  3-6)  of  RANDOM'S  subroutine  TRY-GUESS.  At  this  point  in  the 
algorithm,  there  is  a  set  U  of  undelivered  messages  and  a  value  for  A.  The  lemma  shows 
that  if  A  is  indeed  an  upper  bound  on  the  load  factor  A {U)  of  the  undelivered  messages 
when  the  loop  begins,  then  A/2  is  an  upper  bound  after  the  loop  terminates.  This  lemma 
is  the  crucial  step  in  showing  that  RANDOM  works. 


Lemma  7  Let  U  be  a  set  of  messages  on  an  n-processor  fat-tree  with  conges¬ 
tion  parameter  r,  and  assume  X(U)  <  A.  Then  after  lines  8-6  of  RANDOM’S 
subroutine  TRY-GUESS,  the  probability  is  at  most  0(l/n3)  that  X(U)  >  |A. 
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Proof.  The  idea  of  the  proof  is  to  show  that  the  load  factor  of  an  arbitrary  channel  c 
remains  larger  than  |A  with  probability  0(l/ns).  Since  the  channel  c  is  chosen  arbitrarily 
out  of  the  4n  —  2  channels  in  the  fat-tree,  the  probability  is  at  most  0(1/ n2)  that  any  of 
the  channels  is  left  with  load  factor  larger  than  5 A. 

For  convenience,  let  C  be  the  subset  of  messages  that  must  pass  through  channel  c 
and  are  undelivered  at  the  beginning  of  the  innermost  loop  in  RANDOM.  Let  C0  =  C , 
and  for  t  >  1,  let  C,  C  C,_  1  denote  the  set  of  undelivered  messages  at  the  end  of  the 
»th  iteration  of  the  loop.  Notice  that  we  have  A (Ci,c)  =  |C,|  /cap(c),  since  we  have 
|Cj|  =  losd(C,,c)  by  definition. 

We  new  show  there  exist  values  for  the  constants  Jfci  and  in  line  3  of  TRY-GUESS 
such  that  for  z  =  max{fciA,A:jlgn},  the  probability  is  0(l/n8)  that  A (C„e)  >  |A,  or 
equivalently,  that 

\Cm\  >  ^Acap(e)  .  (1) 

It  sufices  to  prove  that  the  probability  is  0(l/ns)  that  fewer  than  |  |C|  messages 
from  C  are  delivered  during  the  z  cycles  under  the  assumption  that  |C,|  >  |Acap(c)  for 
i  ~  0, 1, . . . ,  z — 1.  The  intuition  behind  the  assumption  |C,|  >  |Acap(c)  is  that  otherwise, 
the  load  factor  on  channel  e  is  already  at  most  5  A  at  this  step  of  the  iteration.  The  reason 
we  need  only  bound  the  probability  that  fewer  than  |  |C|  messages  are  delivered  during 
the  z  cyoes  is  that  inequality  (l)  implies  that  the  number  of  messages  delivered  is  fewer 
than  |C|  -  |Acap(c)  <  |C|  —  lA(C,c)cap(c)  <  |  |C|. 

We  shall  establish  the  0(  1/n3)  bound  on  the  probability  that  at  most  |  |C|  messages 
are  delivered  in  two  steps.  For  convenience,  we  shall  call  a  cycle  good  if  at  least  cap(e)/8r 
messages  are  delivered,  and  bad  otherwise.  In  the  first  step,  we  bound  the  probability  that 
a  given  cycle  is  bad.  Using  Lemma  6  with  p  —  1/rA  <  l/r\(U)  <  l/rA(C’,)  in  conjunction 
with  the  assumption  that  |C,|  >  |Acap(c),  we  can  conclude  that  the  expected  number  of 
messages  delivered  in  any  given  cycle  is  greater  than  j^|Acap(e)  >  cap(e)/4r.  Then  by 
Lemma  4,  the  probability  that  a  given  cycle  is  bad  is  at  most  1  —  l/(8r  —  1)  <  I  —  1/8 r. 
(Although  this  bound  is  sufficiently  strong  to  prove  our  theoretical  results,  it  is  weak 
because  the  probability  that  a  message  is  delivered  in  a  given  cycle  is  not  independent 
from  the  probabilities  for  other  messages,  and  thus  we  must  rely  on  the  bound  given  by 
Lemma  <• .  In  practice,  one  would  anticipate  that  the  dependencies  are  weak,  and  that 
the  algorithm  would  be  effective  with  much  smaller  values  for  the  constants  ki  and 
than  we  ,->rove  here.) 

The  second  step  bounds  the  probability  that  a  substantial  fraction  of  the  z  delivery 
cycles  are  bad.  Specifically,  we  sh'-w  that  the  probability  is  1  —  0(l/ns)  that  at  least 
some  small  constant  fraction  q  oi  the  z  cycles  are  good.  By  picking  ki  =  4 r/q,  which 
implies  z  >  4r\/q,  at  least  g«cap(e)/8r  >  \  |C|  messages  will  be  delivered.  We  bound  the 
probability  that  at  least  (1  —  q)z  of  the  2  cycles  are  bad  by  using  a  counting  argument. 
There  ar->  *,j,)  ways  of  picking  the  bad  cycles,  and  the  probability  that  a  cycle  is  bad 
is  at  most  1  —  l/8r.  Thus,  the  probability  that  at  most  £  \C\  messages  are  delivered  is 

Pr  {  <  \  |C|  messages  delivered}  <  ^  (l  - 


< 


< 


2-.flir  f 


if  we  choose  q  —  l/e4rlnr,  as  the  reader  may  verify.  Since  2  =  max{fc1A,fcJ  Ig  n,},  if  we 
choose  fcj  =  36r,  the  probability  that  fewer  than  \  \C\  messages  are  delivered  is  at  most 
l/ns.  | 

Now  we  can  analyze  RANDOM  as  a  whole. 


Theorem  8  For  any  message  set  M  on  an  n-processor  fat-tree,  the  proba¬ 
bility  is  at  least  1  —  0(l/n)  that  RANDOM  will  deliver  all  the  messages  of 
M  within  the  number  of  delivery  cycles  specified  by  Figure  l. 

Proof.  First,  we  will  show  that  if  >  A (M),  the  probability  is  at  most  0(l/n) 
that  the  loop  in  lines  2  through  8  of  TRY-GUESS  fails  to  yield  A(C/)  <  1.  Initially, 
A  >  A (£/),  and  we  know  from  Lemma  7  that  the  probability  is  at  most  0(l/n*)  that  any 
given  iteration  of  the  loop  fails  to  restore  this  condition  as  A  is  halved.  Since  there  are 
lg  A«mm  iterations  of  the  loop,  we  need  only  make  the  reasonable  assumption  that 
is  polynomial  in  n  to  obtain  a  probability  of  at  most  0(l/n)  that  A {U)  remains  greater 
than  1  after  all  the  iterations  of  the  loop. 

Now  we  just  need  to  count  the  number  of  delivery  cycles  that  have  been  completed 
by  the  time  we  call  TRY-GUESS  with  a  A,**  such  that  A (M)  <  A^,,,.  Let  us  denote  by 
A^,ff,  the  first  A?w„  that  satisfies  this  condition,  and  then  break  the  analysis  down  into 
cases  according  to  the  value  of  X(M). 

For  A (M)  <  1,  we  do  not  actually  even  call  TRY-GUESS.  We  need  only  count  the 
one  delivery  cycle  executed  in  line  1  of  RANDOM. 

For  1  <  A (M)  <  2,  we  need  add  only  the  fcj  lg  n  cycles  executed  when  we  call 
TRY-GUESS{2). 

For  2  <  A  (M)  <  (kj/fci)  lgn,  the  number  of  delivery  cycles  involved  in  each  execution 
of  TRY-GUESS  is  0(lg  A  lgn),  since  we  perform  0(lg  A^M)  iterations  of  the  loop  in 
lines  2-8  of  TRY-GUESS ,  each  containing  fcjlgn  iterations  of  the  loop  in  lines  3-6.  The 
value  of  A^fff  is  at  most  (A(Af))*,  so  the  number  of  delivery  cycles  is  0(lg  n  lg(A(M))J) 

for  the  last  guess,  0(lgnlg  A(Af))  for  the  second-to-last  guess,  0(lg  n  lg  y^A(M))  for  the 
third-to-last  guess,  and  so  on.  The  total  number  of  delivery  cycles  is,  therefore, 

E  0(lgnlg(A(Af))a,-i)  =  £  0(21-lgnlg(A(Af))) 

0<t£l+l«l«A(M)  0<i^  1+lf  Ig  X(M) 

=  0(lgnlgA(M)), 

since  the  series  is  geometric. 

For  A(Af)  >  (Aj/fci)  lgn,  the  number  of  delivery  cycles  executed  by  the  time  we  reach 
line  8  of  RANDOM  is  O(lgnlglgn)  according  to  the  preceding  analysis,  and  then  we 
must  continue  in  the  quest  to  reach  AJ^.  If  A  (M)  <  (kj/ibi)  lg  nig  lgn,  then  we  need 
only  add  the  O(lgnlglgn)  =  0(lgnlg  A(M))  delivery  cycles  involved  in  the  single  call 
TR Y- G UESS({kt /kx)  lg  n  lg  lg  n). 


If  A(M)  >  (£,/*,)  lgnlglgn,  the  number  of  delivery  cycles  executed  before  reaching 
line  8  is  O(lgnlglgn)  as  before,  which  is  0(A(Af)).  We  must  then  add  0(A^„)  cycles  for 

each  call  of  TRY-GUESS  in  line  10.  Since  A* _ is  at  most  2A(Af),  the  total  additional 

number  of  delivery  cycles  is 

E  0(2‘-'MAf))  =  O(HM)), 

0<i<t 

where  t  =  1  •+•  lg(fcjA(Af)/A;j lgnlglgn).  The  total  number  of  delivery  cycles  is  thus 
0(A(Af)).| 

The  1  —  0(l/n)  bound  on  the  probability  that  RANDOM  delivers  all  the  messages 
can  be  improved  to  1  —  0(l/n*)  for  any  constant  k  by  choosing  kj  =  12(fc  +  2)r,  or  by 
simply  running  the  algorithm  through  more  choices  of  A^*. 

We  can  also  use  RANDOM  to  obtain  a  routing  algorithm  which  guarantees  to  deliver 
all  the  messages  in  finite  time  with  expected  number  of  delivery  cycles  given  in  Figure  4. 
We  simply  interleave  RANDOM  with  any  routing  strategy  that  guarantees  to  deliver 
at  least  one  message  in  each  delivery  cycle.  If  the  number  of  messages  is  bounded  by 
some  polynomial  nk,  then  we  choose  such  that  RANDOM  works  with  probability 
1  -  0(l/n‘). 

5  Greedy  strategies 

It  is  natural  to  wonder  whether  a  simple  greedy  strategy  of  sending  all  undelivered 
messages  on  each  delivery  cycle,  and  letting  them  battle  their  ways  through  the  switches, 
might  be  as  effective  as  RANDOM,  which  we  have  shown  to  work  well  on  every  message 
set.  As  a  practical  matter,  a  greedy  strategy  may  be  a  good  choice,  but  it  seems  difficult 
to  obtain  tight  bounds  on  the  running  time  of  greedy  strategies,  and  in  fact,  we  can  show 
that  no  naive  greedy  strategy  works  as  well  as  RANDOM  in  terms  of  asymptotic  running 
times.  For  simplicity,  we  restrict  our  proof  to  deterministic  strategies  and  comment  later 
on  the  extension  to  the  probabilistic  case.  Specifically,  we  show  that  for  a  wide  class  of 
deterministic  greedy  strategies,  there  exist  ^  -processor  fat-trees  and  message  sets  with 
load  factor  A  such  that  fl(A  lgn)  dellv  ■  y  cycles  are  required.  This  lower-bound  result  is 
based  on  an  idea  originally  due  to  M.  Maley  [11]. 

Figure  5  shows  the  greedy  algorithm.  The  code  for  GREEDY  does  not  completely 
specify  the  behavior  of  message  routing  on  a  fat-tree  because  the  switches  have  a  choice 
as  to  which  messages  to  drop  when  '.here  is  congestion.  (The  processors  also  have  this 
choice,  but  we  shall  think  of  them  as  being  switches  as  well.)  In  the  analysis  of  RANDOM, 
we  could  presume  that  all  messages  in  a  channel  were  lost  if  the  channel  was  congested. 
To  completely  specify  the  behavior  of  GREEDY,  we  must  define  the  behavior  of  switches 
when  channels  are  congested. 

The  lower  bound  for  GREEDY  covers  a  wide  range  of  switch  behaviors.  Specifically, 
we  assume  the  switches  have  the  two  properties  below. 

1.  Each  switch  is  greedy  in  that  it  only  drops  messages  if  a  channel  is  congested,  and 
then  only  the  minimum  number  necessary. 


1  while  M  /  0  do 

2  send  M 

3  M  *—  M  —  {messages  delivered} 

4  endwhile 

Figure  5:  The  algorithm  GREEDY  for  delivering  a  message  set  M.  This  algorithm  repeatedly  sends  all 
undelivered  messages.  The  performance  is  highly  dependent  on  the  behavior  of  the  switches. 

2.  Each  switch  is  oblivious  in  that  decisions  on  which  messages  to  drop  are  not  based 
on  any  knowledge  of  the  message  set  other  than  the  presence  or  absence  of  messages 
on  the  switch's  input  lines. 

We  define  the  switches  of  a  fat-tree  to  be  admissible  if  they  have  these  two  properties. 
The  conditions  are  satisfied,  for  example,  by  switches  that  drop  excess  messages  at 
random,  or  by  switches  that  favor  one  input  channel  over  another.  An  admissible  switch 
can  even  base  its  decisions  on  previous  decisions,  but  it  cannot  predict  the  future  or 
make  decisions  based  on  knowing  what  (or  how  many)  messages  it  or  other  switches 
have  dropped.  (The  definition  of  oblivious  in  property  2  can  be  weakened  to  include  an 
even  wider  range  of  switch  behaviors  without  substantially  affecting  our  results.) 

It  is  also  important  to  realise  that  the  lower  bound  proof  for  the  greedy  strategy 
which  we  will  present  does  not  apply  to  every  possible  choice  of  channel  capacities  in 
the  fat-tree.  Our  result  is  strong  in  the  sense  that  it  provides  a  lower  bound  on  the  time 
required  just  to  route  messages  from  the  leaves  out  the  root,  but  it  does  not  apply  to 
certain  types  of  fat-trees.  For  example,  on  a  fat-tree  in  which  channel  capacities  double 
at  every  level,  there  is  never  any  congestion  in  routing  from  the  leaves  to  the  root,  so  a 
greedy  strategy  is  guaranteed  to  finish  in  A  delivery  cycles.  Similarly,  a  fat-tree  in  which 
all  channel  capacities  are  the  same  will  also  require  only  A  delivery  cycles.  The  lower 
bound  does  apply  to  a  wide  variety  of  fat-trees  which  exhibit  a  substantial  degree  of 
uniform  and  nonextreme  growth.  For  the  sake  of  simplicity,  we  shall  consider  fat-trees 
like  the  one  in  Figure  1  in  which  the  channel  capacities  double  at  every  other  level.  As 
discussed  earlier,  these  fat-trees  are  universal.  We  also  assume  that  the  number  of  levels 
(lgn)  is  even  and  that  the  capacities  of  the  channels  nearest  the  processors  are  1.  We 
refer  to  such  a  fat-tree  as  a  standard  fat-tree. 

We  are  now  ready  to  state  the  lower-bound  theorem  for  GREEDY.  At  this  point,  we 
restrict  attention  to  deterministic  strategies. 

Theorem  0  Consider  an  n-processor  standard  fat-tree  with  deterministic  ad¬ 
missible  switches.  Then  there  exist  message  sets  with  load  factor  A  on  which 
GREEDY  requires  Q(Algn)  delivery  cycles. 

Proof.  The  proof  is  by  induction  on  the  height  (lg  n)  of  the  fat-tree.  In  order  to  make 
the  induction  go  through,  we  first  strengthen  the  statement  of  the  theorem  as  follows: 

Claim:  Let  FT  be  an  n-processor  standard  fat-tree,  possibly  embedded 
within  a  larger  fat-tree,  with  deterministic  admissible  switches.  Then  if 
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GREEDY  is  applied  to  routing  messages  out  the  root  of  FT,  there  exists,  for 
any  A  >  12,  a  “bad”  message  set  Mn  on  FT  which  has  the  following  three 
properties: 

1.  The  message  set  Mn  has  load  factor  at  most  A. 

2.  If  at  most  -faXy/ti  messages  of  Mn  are  removed  from  FT,  then  the  root 
channel  is  full  for  each  of  the  delivery  cycles  during  which  the  messages 
are  removed. 

3.  At  least  an  additional  —X  lg  n  delivery  cycles  are  required  to  deliver  all 
the  remaining  messages  in  Mn. 


For  the  base  case  we  consider  a  tree  with  1  processor,  that  is,  one  leaf  connected 
to  a  root  channel  of  capacity  1.  Then  if  we  assign  A  messages  to  be  sent  from  the 
single  processor,  the  root  channel  will  remain  congested  throughout  the  removal  of  ~X 
messages,  which  will  certainly  leave  us  with  additional  messages  requiring  additional 
delivery  cycles.  (Without  loss  of  generality,  we  assume  henceforth  that  ^A  is  integral, 
since  we  could  otherwise  use  [^Aj  with  only  a  constant  factor  change.) 

Now  we  show  that  the  claim  is  true  for  a  standard  fat-tree  FT  with  n  processors 
assuming  that  it  is  true  for  standard  fat-trees  with  n/4  processors.  We  will  construct 
a  message  set  M„  for  FT  which  satisfies  properties  1,  2,  and  3  by  using  an  adversary 
argument.  We  will  first  partially  specify  the  pattern  of  inputs  seen  by  the  root  switch 
of  FT.  Then  the  root  switch  must  indicate  what  its  behavior  is  under  these  conditions. 
Finally,  we  will  use  this  information  to  determine  a  message  set  Af„  which  is  consistent 
with  the  specified  input  pattern  and  which  satisfies  properties  1,  2,  and  3. 

We  begin  by  specifying  that  the  input  channels  of  the  root  switch  of  FT  are  full  for  t 
delivery  cycles,  where  t  is  ^A  plus  the  number  of  delivery  cycles  required  to  remove  the 
first  YiXy/n  messages  from  FT.  Since  the  input  channels  are  full  for  t  cycles,  the  behavior 
of  the  oblivious  switch  during  these  cycles  is  determined.  Since  the  root  capacity  is  y/n, 
the  total  number  of  messages  removed  from  FT  during  the  first  t  delivery  cycles  is 
m  =  |A  y/n. 

The  behavior  of  the  root  switch  determines  how  many  of  the  m  messages  removed 
from  FT  by  delivery  cycle  t  come  from  each  of  the  four  subtrees  shown  in  Figure  6.  At 
least  one  of  these  subtrees  provides  no  more  than  m/4  of  the  messages.  We  choose  one 
such  subtree  and  refer  to  it  as  the  unfavored  subtree. 

Having  determined  the  unfavored  subtree  given  the  conditions  specified  so  far,  we 
can  complete  the  construction  of  Mn.  The  unfavored  subtree  will  contain  a  copy  of  the 
bad  message  set  M„/4  for  that  subtree.  Each  of  the  other  three  subtrees  will  contain 
|  A  y/n  messages  evenly  divided  among  the  processors  in  the  subtree.  Now  we  must  show 
that  Mn  meets  all  of  our  requirements. 

First,  we  show  that  Mn  is  consistent  with  the  input  pattern  specified  for  the  root 


switch,  and  then  we  show  that  it  satisfies  properties  1,  2,  and  3.  As  a  preliminary  step, 
observe  that  the  number  of  messages  provided  by  the  unfavored  subtree  by  delivery  cycle 


capacity  y/n 


Figure  0:  Construction  of  Mm  for  the  proof  of  Theorem  9.  The  subtree  from  which  the  fewest  number  of 
messages  have  been  delivered  bjr  a  certain  time  is  loaded  with  the  largest  number  of  messages. 


t  is  at  most  m/4  =  -^Xyfn/4,  which  we  shall  use  to  invoke  the  induction  hypothesis  on 
the  subtree. 

To  show  that  the  input  channels  of  the  root  switch  of  FT  are  full  through  the  first 
t  delivery  cycles,  it  suffices  to  show  that  the  root  channels  of  the  four  subtrees  are  full 
through  this  time.  The  root  channel  of  the  unfavored  subtree  is  full,  by  the  induction 
hypothesis  (property  2),  since  we  have  shown  that  the  number  of  messages  removed 
from  this  subtree  by  delivery  cycle  t  is  sufficiently  small.  The  root  channel  of  each  other 
subtree  is  also  full  since  it  is  the  source  of  at  most  m  messages  during  the  first  t  delivery 
cycles,  and  the  subtree’s  message  set  consists  of  m  messages  arranged  in  such  a  way  as 
to  maintain  a  full  root  channel  at  least  until  m  messages  have  been  delivered. 

We  now  show  that  the  three  properties  hold  for  M*.  The  load  factor  is  at  most  A  in 
each  of  the  subtrees,  so  the  total  number  of  messages  in  Mn  is  at  most 

^Av/n  +  3  •  gAv/£  =  A  y/n. 


Thus,  the  load  factor  of  Af»  on  FT  is  at  most  A  and  property  1  holds.  Property  2  is 
satisfied  for  M„  because  the  root  switch  is  greedy.  We  have  already  shown  that  the 
input  channels  of  the  root  switch  are  full  through  delivery  cycle  t,  so  the  root  channel 
is  certainly  full  for  the  required  amount  of  time.  Finally,  property  3  holds  because  after 
running  ^A  delivery  cycles,  we  can  still  invoke  the  induction  hypothesis  to  conclude  that 
an  additional  ^Alg(n/4)  cycles  are  required  to  empty  the  unfavored  subtree.  Thus  the 
total  number  of  cycles  required  to  deliver  all  the  messages  in  M„  is  at  least  lg  n.  | 
When  probabilistic  admissable  switches  are  permitted,  the  proof  of  Theorem  9  can 
be  extended  to  show  that  the  expected  number  of  delivery  cycles  is  H(Algn).  The 
idea  is  that  at  least  one  of  the  subtrees  in  Figure  6  must  be  unfavored  with  probability 
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at  least  1/4.  We  call  one  such  subtree  the  often-unfavored  subtree.  The  construction 
of  Mn  proceeds  as  before,  with  the  often-unfavored  subtrees  playing  the  previous  role 
of  the  unfavored  subtrees.  In  any  particular  run  of  GREEDY,  we  expect  1/4  of  the 
often-unfavored  subtrees  to  be  unfavored,  so  there  is  a  0(1)  probability  that  1/8  of  the 
often-unfavored  subtrees  are  unfavored  (Lemma  4).  Thus,  the  probability  is  0(1)  that 
fl(A  lgn)  delivery  cycles  are  required,  which  means  that  the  expected  number  of  delivery 
cycles  is  fl(A lgn). 

Although  we  have  shown  an  unfavorable  comparison  of  GREEDY  to  RANDOM,  it 
should  be  noted  that  the  lower  bound  we  proved  for  routing  messages  out  the  root  is 
achievable.  That  is,  routing  of  messages  out  the  root  or,  more  generally,  up  the  tree 
only,  can  be  accomplished  by  GREEDY  in  O(Algn)  delivery  cycles.  This  can  be  seen 
by  observing  that  the  highest  congested  channel  (closest  to  the  root)  must  drop  at  least 
one  level  every  A  delivery  cycles.  If  one  could  establish  an  upper  bound  of  A  times  a 
poly  logarithmic  factor  for  the  overall  problem  of  greedy  routing,  it  would  show  that 
GREEDY  still  has  merit  despite  its  inferior  performance  in  comparison  to  RANDOM. 

6  Further  results 

This  section  contains  additional  results  relevant  to  routing  on  fat-trees.  We  first  present 
an  improved  version  of  the  universality  theorem  from  [10].  Then  we  give  two  results  on 
fat-tree  routing  in  special  cases. 

Universality 

The  performance  of  the  routing  algorithm  RANDOM  allows  us  to  generalize  the  uni¬ 
versality  theorem  from  [10],  which  states  that  a  universal  fat-tree  of  a  given  volume  can 
simulate  any  other  routing  network  of  equal  volume  with  only  a  polylogarithmic  factor 
increase  in  the  time  required.  The  original  proof  assumed  the  simulation  of  the  rout¬ 
ing  was  off-line.  Our  results  show  that  the  simulation  can  be  carried  out  in  the  more 
interesting  on-line  context. 

Theorem  10  Let  FT  be  a  uninrsal  fat-tree  of  volume  v,  and  let  R  be  an 
arbitrary  routing  network  also  of  volume  v  on  a  set  of  n  =  0{vf  lgS/,J  v) 
processors.  Then  the  processors  of  R  can  be  mapped  to  processors  of  FT  such 
that  any  message  set  M  that  can  lr.  delivered  in  time  t  by  R  can  be  delivered 
by  FT  in  time  0((t  +  lglgn)  lg’  n)  with  probability  1  —  0(l/n). 

Sketch  of  proof.  The  proof  parallels  that  of  [10].  The  reader  is  referred  to  that  paper 
for  details.  The  routing  network  R  of  volume  t>  is  mapped  to  FT  in  such  a  way  that 
any  message  set  M  that  can  be  delivered  in  time  t  by  R  puts  a  load  factor  of  at  most 
0(t  lg(n/y,/,s))  on  FT.  By  Theorem  8,  the  message  set  M  can  be  delivered  by  RANDOM 
in  0(t  lg(n/vJ/s)  +  lg  nig  lgn)  delivery  cycles  with  high  probability.  Since  each  delivery 
cycle  takes  at  most  0(lg3  n)  time,  the  result  follows.  | 


Off-line  routing 

Our  analysis  for  RANDOM  has  repercussions  for  the  off-line  routing  case.  Since  we  have 
shown  that  with  high  probability,  the  number  of  delivery  cycles  given  by  Figure  4  suffices 
to  deliver  a  message  set  with  load  factor  A,  there  must  exist  off-line  schedules  using  only 
this  many  delivery  cycles,  which  improves  the  bound  of  O(Algn)  given  in  [10].  The 
previous  off-line  bound  was  proved  by  deterministically  constructing  a  routing  schedule 
that  achieves  the  bound.  Our  better  bound  does  not  yield  a  deterministic  construction 
of  the  routing  schedule,  but  it  does  yield  a  probabilistic  one. 

Perhaps  the  bound  on  off-line  routing  can  be  further  improved  (e.g.,  to  0(A  +  lgn)). 
The  integer  programming  framework  of  Raghavan  and  Thompson  [13]  is  one  possible 
approach  which  might  give  a  probabilistic  construction  that  achieves  this  bound.  On  the 
other  hand,  it  may  be  possible  to  apply  more  direct  combinatorial  techniques  to  yield 
an  improved  deterministic  bound. 

Larger  channel  capacities 

We  can  improve  the  results  for  on-line  routing  if  each  channel  c  in  the  fat-tree  is  suf¬ 
ficiently  large,  that  is  if  cap(c)  =  fl(lgn)  Specifically,  we  can  deliver  a  message  set  M 
in  0(A(M))  delivery  cycles  with  high  probability,  i.e.,  we  can  meet  the  lower  bound  to 
within  a  constant  factor.  The  better  bound  is  achieved  by  the  algorithm  RANDOM 1 
shown  in  Figure  7. 

Theorem  11  For  any  message  set  M  on  an  n-processor  fat-tree  with  chan¬ 
nels  of  capacity  fl(lgn),  the  probability  is  at  least  l  —  0(l/n)  that  RANDOM' 
will  deliver  all  the  messages  of  M  in  0(A(M))  delivery  cycles,  if  X (M)  is 
polynomially  bounded. 

Proof.  Let  the  lower  bound  on  channel  size  be  a  Ig  n,  and  let  n*  be  the  polynomial  bound 
on  the  load  factor  A  (M).  We  consider  only  the  pass  of  the  algorithm  when  z  first  exceeds 
e2(fc+,)/aA(M).  We  ignore  previous  cycles  for  the  analysis  of  message  routing,  except  to 
note  that  the  number  of  delivery  cycles  they  require  is  0[X{M)). 

We  first  consider  a  single  channel  e  within  a  single  cycle  i  from  among  the  z  delivery 
cycles  in  the  pass.  Since  each  message  has  probability  1  fz  of  being  sent  in  cycle  t,  we 
can  apply  Lemma  5  with  p  —  l/t  to  conclude  that  the  probability  that  channel  c  is 
congested  in  cycle  i  is  at  most 

2-i±le»p(e) 

2-(*+J)if» 

1 

n*+*  ' 

Since  there  are  O(n)  channels,  the  probability  that  there  exists  a  congested  channel  in 
cycle  i  is  0(l/nfc+l).  Finally,  since  there  are  *  <  2e2^M^aX{M)  =  0(A(Af))  =  0(nk) 
cycles,  the  probability  is  0(l/n)  that  there  exists  a  congested  channel  in  any  delivery 
cycle  of  the  pass.  | 
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z  * —  l  ! 

while  M  #  0  do  j 

for  each  message  m  €  M ,  choose  a  random  number  *m  6  {1, 2, . . . ,  z}  ' 

for  *  «—  1  to  z  do 

send  all  messages  m  such  that  tm  =  » 

endfor  | 

z  «—  2z 
endwhile 


Figure  7:  The  algorithm  RANDOM'  for  routing  in  a  fat-tree  with  channels  of  capacity  fl(lgn).  This 
algorithm  repeatedly  double#  a  guessed  number  of  delivery  cycles,  #.  For  each  gueaa,  each  message  is 
randomly  sent  in  one  of  the  delivery  cycles. 


Figure  8:  Another  fat-tree  design.  The  switches  in  this  structure  have  constant  sise. 
Another  universal  network 

We  have  recently  discovered  a  fat-t:  ee  design  which  uses  simpler  switches  than  the  fat-tree 
described  in  Section  1  and  [10].  Figi’7'..  3  illustrates  the  structure  of  a  two-dimensional 
universal  fat-tree  of  this  new  typo.  Each  of  the  switches  in  this  fat-tree  can  switch 
messages  among  four  child  switches  ami  two  parent  switches.  The  area  of  the  fat-tree 
is  0(nlg3n).1  In  three  dimensions,  we  can  use  switches  with  eight  children  and  four 
parents  to  obtain  a  fat-tree  with  volume  0(nlga/J  n). 

The  new  fat-tree  design  satisfies  the  universality  property  of  Theorem  10,  except  that 
the  degradation  in  time  is  0(lg4n).  The  new  fat- tree  structure  removes  a  factor  of  lgn 
from  the  time  to  perform  a  delivery  cycle  since  the  switches  have  constant  depth.  The 

1  Interestingly,  a  meah-of-trees  [8]  can  be  directly  embedded  in  this  fat-tree.  In  fact,  it  can  be  shown 
using  sorting  arguments  that  a  mesh-of-trees  i*  are  a- universal  [9]. 
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number  of  delivery  cycles  needed  to  route  a  set  M  of  messages  is  0(A(Af )  lg3  n),  however, 
which  yields  A(M)lg#n  total  time,  as  compared  with  (A(M)  4  lg  n  lg  lg  n)  lg3  n  for  the 
original  fat-tree. 

The  mechanics  of  routing  on  the  new  fat-tree  are  somewhat  different  than  on  the 
original.  The  underlying  channel  structure  for  the  two  fat-trees  is  the  same,  but  the  new 
fat-tree  does  not  rely  on  concentrators  to  make  efficient  use  of  the  available  output  wires. 
Instead,  each  message  sent  through  the  fat-tree  randomly  chooses  which  parent  to  go  to 
next  (based  on  random  bits  embedded  in  its  address  field)  until  it  reaches  the  apex  of 
its  path,  and  then  it  takes  the  unique  path  downward  to  its  destination.  This  strategy 
guarantees  that  for  any  given  channel  through  which  a  message  must  pass,  the  message 
has  an  equal  likelihood  of  picking  any  wire  in  the  channel. 

The  routing  algorithm  is  a  modification  of  the  algorithm  RANDOM' .  We  simply 
surround  lines  3-6  with  a  loop  that  executes  these  lines  ( k  4- 1)  lgn  times,  where  \M\  = 
0(n*). 

The  proof  that  the  algorithm  works  applies  the  analysis  from  Section  4  to  individual 
wires,  treating  them  as  channels  of  capacity  1.  Consider  a  wire  w  traversed  by  a  message 
in  a  p-subset  M’  of  M,  and  consider  the  channel  c  that  contains  the  wire.  For  any 
other  message  in  Af,  the  probability  is  p/eap(e)  that  the  message  is  directed  to  wire  w 
when  the  message  set  M 1  is  sent.  Thus,  the  probability  that  w  is  congested  is  at  most 
B(l,  load(M,  c),p/cap(e))  <  epA(Af),  and  an  analogue  to  Lemma  5  holds  because  the 
capacity  of  w  is  1.  Lemma  6,  which  says  that  the  probability  is  \p  that  a  given  message 
of  M  is  delivered  when  a  p-subset  of  M  is  sent,  also  holds  if  the  congestion  parameter  r 
is  chosen  to  be  6 (lgn). 

We  can  now  prove  a  bound  of  0(A(M)  lg3  n)  on  the  number  of  delivery  cycles  required 
by  the  algorithm  to  deliver  all  the  messages  in  M.  It  suffices  to  show  that  with  high 
probability,  all  the  messages  in  M  get  routed  when  the  variable  z  in  the  algorithm 
reaches  ©(A (M)  lgn).  When  z  >  rA(Af)  =  9(A(Af)  lgn),  any  given  message  m  is  sent 
once  during  a  single  pass  through  lines  3-6,  and  the  probability  that  the  message  is  not 
delivered  on  that  pass  is  at  most  |.  Thus,  the  probability  that  m  is  not  delivered  on 
any  of  the  ( k  4- 1)  lgn  passes  through  lines  3-6  is  at  most  l/n*+1.  Since  the  number  of 
messages  in  M  is  0(n*),  the  probability  is  0(l/n)  that  a  message  exists  which  is  not 
routed  by  the  time  z  reaches  rA(Af). 

7  Concluding  remarks 

This  paper  has  studied  the  problem  of  routing  messages  on  fat-tree  networks.  We  have 
obtained  good  bounds  for  randomized  routing  based  on  the  load  factor  of  a  set  of  mes¬ 
sages.  Our  algorithms  directly  address  the  problem  of  message  congestion  and  require  no 
intermediate  buffering,  unlike  many  algorithms  in  the  literature.  We  have  shown  how  to 
use  the  routing  algorithms  to  prove  that  fat-trees  are  volume-universal  networks.  This 
section  discusses  some  directions  for  future  research. 

The  analysis  of  the  algorithm  RANDOM  gives  reasonably  tight  asymptotic  bounds 
on  its  performance,  but  the  constant  factors  in  the  analysis  are  large.  In  practice,  smaller 
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constants,  probably  suffice,  but  it  is  difficult  to  simulate  the  algorithm  to  determine  what 
constants  might  be  better.  Unlike  Valiant’s  algorithm  for  routing  on  the  hypercube,  our 
algorithm  does  not  have  the  same  probabilistic  behavior  on  all  sets  of  messages,  and 
therefore,  the  simulation  results  may  be  highly  correlated  with  the  specific  message  sets 
chosen.  The  search  for  good  constants  is  thus  a  multidimensional  search  in  a  large  space, 
where  each  data  point  represents  an  expensive  simulation. 

Although  we  have  shown  that  GREEDY  is  asymptotically  worse  than  RANDOM , 
it  may  be  that  it  is  more  practical  to  implement.  The  logarithmic-factor  overhead  that 
we  have  been  able  to  show  is  mitigated  by  a  constant  factor  of  dj.  Simulations  indicate 
that  a  gr?edy  algorithm  might  actually  work  quite  well  [6],  but  we  have  been  unable  to 
provide  a  good  upper  bound  on  its  performance.  Despite  the  simplicity  of  control  offered 
by  GREEDY ,  it  seems  unwise  to  base  the  design  of  a  large,  parallel  supercomputer 
on  unproven  conjectures  of  performance.  Thus,  a  comprehensive  analysis  of  GREEDY 
remains  an  important  open  problem. 

The  idea  of  using  load  factors  to  analyze  arbitrary  networks  is  a  natural  one.  We 
have  been  successful  in  analyzing  fat-trees  using  this  measure  of  routing  difficulty.  It  may 
be  possible  to  analyze  other  networks  in  terms  of  load  factor,  but  some  improvement  to 
our  techniques  seems  to  be  necessary  if  channel  widths  are  small  and  the  diameter  of 
the  network  is  large.  The  problem  is  that  a  message  that  passes  through  many  small 
channels  has  a  high  likelihood  of  conflicting  with  other  messages.  One  solution  might 
involve  buffering  messages  in  intermediate  processors  or  switches. 

The  high  probability  results  reported  in  this  paper  for  routing  on  fat-trees  are  almost 
deterministic  in  the  sense  that  substantial  deviation  from  the  expected  performance  will 
probably  never  occur  in  one’s  lifetime.  On  the  other  hand,  from  a  theoretical  point  of 
view,  it  would  be  nice  to  match  the  results  of  this  paper  with  truly  deterministic  algo¬ 
rithms.  Most  deterministic  routing  algorithms  in  the  literature  are  based  on  sorting,  and 
thus  a  direct  application  to  fat-trees  causes  congestion  problems,  much  as  does  Valiant’s 
routing  t  ’chnique.  A  deterministic  routing  algorithm  for  fat-trees  that  circumvents  these 
problems  would  yield  even  stronger  universality  p-operties  than  we  have  shown  here. 
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